MonolayerCultures / src / Oligodendroglia / [RC17]Approximation_Models _for _Oligodendroglia.Rmd
[RC17]Approximation_Models _for _Oligodendroglia.Rmd
Raw
---
title: '[RC17]Approximation Models for Oligodendroglia'
author: "Nina-Lydia Kazakou"
date: "10/03/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Set-up

### Load libraries

```{r message=FALSE, warning=FALSE}
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)
library(scDblFinder)
library(Matrix)
library(ggplot2)
library(edgeR)
library(dplyr)
library(ggsci)  
library(here)
library(gtools)
```

### Colour Palette

```{r load_palette}
mypal <- pal_npg("nrc", alpha = 0.7)(10)
mypal2 <-pal_tron("legacy", alpha = 0.7)(7)
mypal3<-pal_lancet("lanonc", alpha = 0.7)(9)
mypal4<-pal_simpsons(palette = c("springfield"), alpha = 0.7)(16)
mypal5 <- pal_rickandmorty(palette = c("schwifty"), alpha = 0.7)(6)
mypal6 <- pal_futurama(palette = c("planetexpress"), alpha = 0.7)(5)
mypal7 <- pal_startrek(palette = c("uniform"), alpha = 0.7)(5)
mycoloursP<- c(mypal, mypal2, mypal3, mypal4)
```

### Load object

```{r}
oligos <- readRDS(here("Data", "Processed", "Oligodendroglia", "oligos_initial_srt.rds"))
```

# Generate & Save DimPlots for all the calculated resolutions:
```{r eval=FALSE}
plot_list_func <- function(seur_obj,
                           col_pattern, 
                           plot_cols, 
                           clust_lab = TRUE,
                           label_size = 8,
                           num_col = 4,
                           save_dir = getwd(),
                           width=7,
                           height=5){
  extr_res_col <- grep(pattern = col_pattern, names(seur_obj@meta.data))
  res_names <- names(seur_obj@meta.data[extr_res_col])
  # gtools function, sorts gene_names alphanumeric:
  res_names <- mixedsort(res_names) 
  plot_l <-list()
  for(i in 1: length(res_names)){
    pdf(paste0(save_dir, 
               res_names[i], "_umap.pdf"), width=width, height=height)
    dim_plot <- DimPlot(seur_obj, 
                        reduction = "umap", 
                        cols= plot_cols,
                        group.by = res_names[i],
                        label = clust_lab,
                        label.size = label_size) + NoLegend()
    print(dim_plot)
    dev.off()
    print(dim_plot)
  }
}


dir.create(here("outs", "Processed", "Oligodendroglia", "DimPlots_AllRes"))

plot_list_func(seur_obj = oligos,
               col_pattern = "RNA_snn_res.",
               plot_cols = mycoloursP[6:40],
               save_dir = here("outs", "Processed", "Oligodendroglia", "DimPlots_AllRes"))
```


# ClusterTree
https://cran.r-project.org/web/packages/clustree/vignettes/clustree.html

A clustering tree is different in that it visualises the relationships between at a range of resolutions.
To build a clustering tree I need to look at how cells move as the clustering resolution is increased. Each cluster forms a node in the tree and edges are constructed by considering the cells in a cluster at a lower resolution (say k=2) that end up in a cluster at the next highest resolution (say k=3). By connecting clusters in this way we can see how clusters are related to each other, which are clearly distinct and which are unstable.

```{r message=FALSE, warning=FALSE}
library(clustree)
```

```{r}
combined_clust_res <- cbind(k.0.5 = oligos@meta.data$RNA_snn_res.0.05,
                            k.1 = oligos@meta.data$RNA_snn_res.0.1,
                            k.2 = oligos@meta.data$RNA_snn_res.0.2,
                            k.3 = oligos@meta.data$RNA_snn_res.0.3,
                            k.4 = oligos@meta.data$RNA_snn_res.0.4, 
                            k.5 = oligos@meta.data$RNA_snn_res.0.5, 
                            k.6 = oligos@meta.data$RNA_snn_res.0.6,
                            k.7 = oligos@meta.data$RNA_snn_res.0.7,
                            k.8 = oligos@meta.data$RNA_snn_res.0.8,
                            k.9 = oligos@meta.data$RNA_snn_res.0.9,
                            k.10 = oligos@meta.data$RNA_snn_res.1)
```

```{r fig.height=10, fig.width=10, message=FALSE, warning=FALSE}
clustree(combined_clust_res, prefix="k.", edge_arrow=FALSE)
```

```{r eval=FALSE}
dir.create(here("outs", "Processed", "Oligodendroglia", "ClusterTree"))

clust_tree_01 <- clustree(combined_clust_res, prefix="k.", edge_arrow=FALSE)

# Step 1: Call the pdf command to start the plot
pdf(file = here("outs", "Processed", "Oligodendroglia", "ClusterTree", "ClusterTree_01.pdf"),
    width = 10,
    height = 10)
# Step 2: Create the plot with R code
print(clust_tree_01)
# Step 3: Run dev.off() to create the file!
dev.off()
```


In this plot, the size of each node is related to the number of samples in each cluster and the color indicates the clustering resolution.
The edges are coloured according to the number of samples they represent and the transparency shows the incoming node proportion, the number of samples in the edge divided by the number of samples in the node it points to.

```{r clustree, fig.height=10, fig.width=10}
clustree(
  oligos,
  prefix = paste0("RNA", "_snn_res."),
  exprs = c("data", "counts", "scale.data"),
  assay = NULL
)
```

```{r eval=FALSE}
clust_tree_02 <- clustree(
  oligos,
  prefix = paste0("RNA", "_snn_res."),
  exprs = c("data", "counts", "scale.data"),
  assay = NULL
)

pdf(file = here("outs", "Processed", "Oligodendroglia", "ClusterTree", "ClusterTree_02.pdf"),
    width = 10,
    height = 10)
print(clust_tree_01)
dev.off()
```


# Sihlouette; Silhouette coefficient

The silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.

For each observation i, the silhouette width sil is calculated as follows:
* For each observation i, calculate the average dissimilarity ai between i and all other points of the cluster to which i belongs.
* For all other clusters C, to which i does not belong, calculate the average dissimilarity d(i,C) of i to all observations of C. The smallest of these d(i,C) is defined as bi=minCd(i,C). The value of bi can be seen as the dissimilarity between i and its neighbor cluster, i.e., the nearest one to which it does not belong.
* Finally the silhouette width of the observation i is defined by the formula: Si=(bi)/max(ai,bi).

Silhouette width can be interpreted as follow:
* Observations with a large Si (almost 1) are very well clustered.
* A small Si (around 0) means that the observation lies between two clusters.
* Observations with a negative Si are probably placed in the wrong cluster.

### Sihlouette plots from srt object 
https://github.com/satijalab/Integration2019/blob/master/analysis_code/integration/integration_metrics.R#L36

```{r}
library(cluster)
```

```{r}
dims <- 1:30
dist.matrix <- dist(x = Embeddings(object = oligos[["pca"]])[, dims])
clusters <- oligos$RNA_snn_res.0.05
sil <- silhouette(x = as.numeric(x = as.factor(x = clusters)), dist = dist.matrix)
oligos$sil <- sil[, 3]
```

```{r mixing_metric}
max.k <- 300
mm <- max.k - MixingMetric(object = oligos, grouping.var = "Sample", reduction = "pca", dims = dims, max.k = max.k)
```

```{r}
DefaultAssay(object = oligos) <- "RNA"
```

```{r local_structure_preservation}
ls <- LocalStruct(object = oligos, grouping.var = "Sample", reduction = "pca", reduced.dims = dims, orig.dims = 1:12)
ls <- unname(obj = unlist(x = ls))

all.metrics <- list(
  silhouette = oligos$sil, 
  mixing.metric = mm,
  local.struct = ls
)

head(sil, 2)
```

***Note: silhouette() returns an object, sil, of class silhouette which is an n x 3 matrix with attributes. For each observation i, sil[i,] contains the cluster to which i belongs as well as the neighbor cluster of i (the cluster, not containing i, for which the average dissimilarity between its observations and i is minimal), and the silhouette width s(i) of the observation. 
The colnames correspondingly are c("cluster", "neighbor", "sil_width"). 

```{r Silhouette_example, fig.height=4, fig.width=4}
plot(sil, border = NA)
plot(sil, border = NA, do.n.k =FALSE, do.clus.stat= FALSE)
```

### Silhouette plots from sce object:
***Note: For the wraper function of sil_plots to work, I need to begin from a resolution that has at least two distinct clusters, so I will need to exclude res.0.01, as well as the AllCelles_res, to avoid confusion
```{r}
oligos_sce <- as.SingleCellExperiment(oligos)

oligos_sce$RNA_snn_res.0.01 <-NULL
oligos_sce$RC17_AllCell_ClusterID <-NULL

colnames(colData(oligos_sce))
```


```{r}
sil_plot <- function(sce_obj, reduction = "PCA",
                     col_pattern = "RNA_snn_res", 
                     plot_cols, 
                     clust_lab = TRUE,
                     label_size = 8,
                     num_col = 4,
                     save_dir = getwd(),
                     width=7,
                     height=5){
  res_col <- grep(pattern = col_pattern, names(colData(sce_obj)))
  names_col <- names(colData(sce_obj))[res_col]
  # gtools function, sorts gene_names alphanumeric:
  names_col <- mixedsort(names_col)
  met_dat <- as.data.frame(colData(oligos_sce))
  distance <- dist(reducedDim(sce_obj, reduction))
  for(i in 1: length(names_col)){
    clust <- met_dat[[names_col[i]]]
    clust_int <- as.integer(paste0(clust))
    sil <- silhouette(clust_int, dist = distance)
    
    pdf(paste0(save_dir, 
               "_", names_col[i], "_sil.pdf"), width=width, height=height)
    plot(sil, border = NA)
    dev.off()
    plot(sil, border = NA)
    if(i == 1){
      av_sil_df <- data.frame(res = names_col[i], 
                              av_sil_w = summary(sil)$avg.width)
    }else{
      append_df <- data.frame(res = names_col[i], 
                              av_sil_w = summary(sil)$avg.width)
      av_sil_df <- rbind(av_sil_df, append_df)
    }
  }
  return(av_sil_df)
}
```

```{r include=FALSE}
dir.create(here("outs", "Processed", "Oligodendroglia", "Silhouette_Plots"))

av_df <- sil_plot(sce_obj = oligos_sce, 
                  col_pattern = "RNA_snn_res.",
                  plot_cols = mycoloursP[6:40],
                  save_dir = here("outs", "Processed", "Oligodendroglia", "Silhouette_Plots"))
```

```{r}
av_df$num_res <- as.numeric(sapply(strsplit(av_df$res,"res."), `[`, 2))
ggplot(av_df, aes(x = num_res, y = av_sil_w)) + 
  geom_line(color="grey") + 
  geom_point(shape=21, color="black", fill=mycoloursP[11], size=6) + 
  theme_bw() +
  ylab("Average silhouette width") +
  scale_x_continuous(name = "Cluster resolution", 
                     breaks = av_df$num_res) +
  theme(axis.text.x = element_text(angle = 90))
```

```{r eval=FALSE}
av_sil_res_plot <- ggplot(av_df, aes(x = num_res, y = av_sil_w)) + 
  geom_line(color="grey") + 
  geom_point(shape=21, color="black", fill=mycoloursP[4], size=6) + 
  theme_bw() +
  ylab("Average silhouette width") +
  scale_x_continuous(name = "Cluster resolution", 
                     breaks = av_df$num_res) +
  theme(axis.text.x = element_text(angle = 90))

pdf(file = here("outs", "Processed", "Oligodendroglia", "Silhouette_Plots", "av_sil_res_plot.pdf"),
    width = 10,
    height = 10)
print(av_sil_res_plot)
dev.off()
```

## Approximate Silhouette 

ApproxSil is useful for large datasets. It is used to to select the most stable clusters based on stats. It performs the calculations on the PC coordinates. The silhouette width is a general-purpose method for evaluating the separation between clusters but requires calculating the average distance between pairs of observations within or between clusters.

```{r}
library(bluster)
```

```{r}
clust <- as.integer(paste0(colData(oligos_sce)$RNA_snn_res.0.05))
sil_approx <- approxSilhouette(reducedDim(oligos_sce, "PCA"), clusters=clust)
```

**approxSilhouette** approximates the average distances for faster computation in large datasets. 
For a given observation, let \tilde D be the approximate average distance to all cells in cluster X. This is defined as the square root of the sum of:
*The squared distance from the current observation to the centroid of cluster X. This is most accurate when the observation is distant to X relative to the latter's variation.
*The summed variance of all variables across observations in cluster X. This is most accurate when the observation lies close to the close to the centroid of X. This is also equivalent to the root-square-mean distance from the current observation to all cells in X.
The approximate silhouette width for each cell can then be calculated with the relevant two values of \tilde D, computed by setting X to the cluster of the current cell or the closest other cluster.

```{r ApproxSil_width_example}
sil_data <- as.data.frame(sil_approx)
sil_data$closest <- factor(ifelse(sil_data$width > 0, clust, sil_data$other))
sil_data$cluster <- factor(clust)
ggplot(sil_data, aes(x=cluster, y=width, colour=closest)) + 
  ggbeeswarm::geom_quasirandom(method="smiley") + theme_bw(20) +
  xlab("Cluster (resolution 0.05)")
```

```{r ApproxSil_purity_example}
pure_0_1 <- neighborPurity(reducedDim(oligos_sce, "PCA"), clusters=clust)
pure_data <- as.data.frame(pure_0_1)
pure_data$maximum <- factor(pure_data$maximum)
pure_data$cluster <- factor(clust)
ggplot(pure_data, aes(x=cluster, y=purity, colour=maximum)) + 
  ggbeeswarm::geom_quasirandom(method="smiley") +
  theme_bw(20) +
  xlab("Cluster (resolution 0.05)")
```

```{r ApproxSil_width, eval=FALSE}
approx_sil <- function(sce_obj, reduction = "PCA",
                       col_pattern = "RNA_snn_res", 
                       plot_cols, 
                       clust_lab = TRUE,
                       label_size = 8,
                       save_dir = getwd(),
                       width=7,
                       height=5){
  res_col <- grep(pattern = col_pattern, names(colData(sce_obj)))
  names_col <- names(colData(sce_obj))[res_col]
  # gtools function, sorts gene_names alphanumeric:
  names_col <- mixedsort(names_col)
  met_dat <- as.data.frame(colData(oligos_sce))
  for(i in 1: length(names_col)){
    clust <- met_dat[[names_col[i]]]
    clust_int <- as.integer(paste0(clust))
    
    sil_approx <- approxSilhouette(reducedDim(sce_obj, reduction), 
                                   clusters = clust_int)
    sil_data <- as.data.frame(sil_approx)
    sil_data$closest <- factor(ifelse(sil_data$width > 0, clust_int, sil_data$other))
    sil_data$cluster <- factor(clust_int)
    
    apr_sil_plot <-ggplot(sil_data, aes(x=cluster, y=width, colour=closest)) + 
      ggbeeswarm::geom_quasirandom(method="smiley") + theme_bw(20) +
      xlab(names_col[i])
    
    
    pdf(paste0(save_dir, 
               "_", names_col[i], "_AppoxSil_width.pdf"), width=width, height=height)
    
    
    print(apr_sil_plot)
    
    dev.off()
    
    print(apr_sil_plot)
  }
  print("Done")
}
```

```{r eval=FALSE}
dir.create(here("outs", "Processed", "Oligodendroglia", "ApproxSil_width_Plots"))

approx_sil(sce_obj = oligos_sce, 
           col_pattern = "RNA_snn_res.",
           plot_cols = mycoloursP[6:40],
           save_dir = here("outs", "Processed", "Oligodendroglia", "ApproxSil_width_Plots"))
```


```{r ApproxSil_purity}
clu_pure <- function(sce_obj, reduction = "PCA",
                     col_pattern = "RNA_snn_res", 
                     plot_cols, 
                     clust_lab = TRUE,
                     label_size = 8,
                     save_dir = getwd(),
                     width=7,
                     height=5){
  res_col <- grep(pattern = col_pattern, names(colData(sce_obj)))
  names_col <- names(colData(sce_obj))[res_col]
  # gtools function, sorts gene_names alphanumeric:
  names_col <- mixedsort(names_col)
  met_dat <- as.data.frame(colData(oligos_sce))
  for(i in 1: length(names_col)){
    clust <- met_dat[[names_col[i]]]
    clust_int <- as.integer(paste0(clust))
    
    pure <- neighborPurity(reducedDim(sce_obj, reduction), clusters = clust_int)
    pure_data <- as.data.frame(pure)
    pure_data$maximum <- factor(pure_data$maximum)
    pure_data$cluster <- factor(clust_int)
    
    
    pure_plot <- ggplot(pure_data, aes(x=cluster, y=purity, colour=maximum)) + 
      ggbeeswarm::geom_quasirandom(method="smiley") +
      theme_bw(20) +
      xlab(names_col[i])
    
    
    pdf(paste0(save_dir, 
               "_", names_col[i], "_AppoxSil_purity.pdf"), width=width, height=height)
    
    print(pure_plot)
    
    dev.off()
    
    print(pure_plot)
  }
  print("Done")
}
```

```{r eval=FALSE}
dir.create(here("outs", "Processed", "Oligodendroglia", "ApproxSil_purity_Plots"))

clu_pure(sce_obj = oligos_sce, 
         col_pattern = "RNA_snn_res.",
         plot_cols = mycoloursP[6:40],
         save_dir = here("outs", "Processed", "Oligodendroglia", "ApproxSil_width_Plots"))
```

```{r}
sessionInfo()
```